58 research outputs found
PatchMixer: Rethinking network design to boost generalization for 3D point cloud understanding
The recent trend in deep learning methods for 3D point cloud understanding is
to propose increasingly sophisticated architectures either to better capture 3D
geometries or by introducing possibly undesired inductive biases. Moreover,
prior works introducing novel architectures compared their performance on the
same domain, devoting less attention to their generalization to other domains.
We argue that the ability of a model to transfer the learnt knowledge to
different domains is an important feature that should be evaluated to
exhaustively assess the quality of a deep network architecture. In this work we
propose PatchMixer, a simple yet effective architecture that extends the ideas
behind the recent MLP-Mixer paper to 3D point clouds. The novelties of our
approach are the processing of local patches instead of the whole shape to
promote robustness to partial point clouds, and the aggregation of patch-wise
features using an MLP as a simpler alternative to the graph convolutions or the
attention mechanisms that are used in prior works. We evaluated our method on
the shape classification and part segmentation tasks, achieving superior
generalization performance compared to a selection of the most relevant deep
architectures.Comment: Published in the Image and Vision Computing journa
Distinctive 3D local deep descriptors
We present a simple but yet effective method for learning distinctive 3D
local deep descriptors (DIPs) that can be used to register point clouds without
requiring an initial alignment. Point cloud patches are extracted,
canonicalised with respect to their estimated local reference frame and encoded
into rotation-invariant compact descriptors by a PointNet-based deep neural
network. DIPs can effectively generalise across different sensor modalities
because they are learnt end-to-end from locally and randomly sampled points.
Because DIPs encode only local geometric information, they are robust to
clutter, occlusions and missing regions. We evaluate and compare DIPs against
alternative hand-crafted and deep descriptors on several indoor and outdoor
datasets consisting of point clouds reconstructed using different sensors.
Results show that DIPs (i) achieve comparable results to the state-of-the-art
on RGB-D indoor scenes (3DMatch dataset), (ii) outperform state-of-the-art by a
large margin on laser-scanner outdoor scenes (ETH dataset), and (iii)
generalise to indoor scenes reconstructed with the Visual-SLAM system of
Android ARCore. Source code: https://github.com/fabiopoiesi/dip.Comment: IEEE International Conference on Pattern Recognition 202
Multi-target tracking and performance evaluation on videos
PhDMulti-target tracking is the process that allows the extraction of object motion patterns of
interest from a scene. Motion patterns are often described through metadata representing object
locations and shape information. In the first part of this thesis we discuss the state-of-the-art
methods aimed at accomplishing this task on monocular views and also analyse the methods for
evaluating their performance. The second part of the thesis describes our research contribution
to these topics.
We begin presenting a method for multi-target tracking based on track-before-detect (MTTBD)
formulated as a particle filter. The novelty involves the inclusion of the target identity
(ID) into the particle state, which enables the algorithm to deal with an unknown and unlimited
number of targets. We propose a probabilistic model of particle birth and death based on Markov
Random Fields. This model allows us to overcome the problem of the mixing of IDs of close
targets.
We then propose three evaluation measures that take into account target-size variations, combine
accuracy and cardinality errors, quantify long-term tracking accuracy at different accuracy
levels, and evaluate ID changes relative to the duration of the track in which they occur. This
set of measures does not require pre-setting of parameters and allows one to holistically evaluate
tracking performance in an application-independent manner.
Lastly, we present a framework for multi-target localisation applied on scenes with a high
density of compact objects. Candidate target locations are initially generated by extracting object
features from intensity maps using an iterative method based on a gradient-climbing technique
and an isocontour slicing approach. A graph-based data association method for multi-target
tracking is then applied to link valid candidate target locations over time and to discard those
which are spurious. This method can deal with point targets having indistinguishable appearance
and unpredictable motion.
MT-TBD is evaluated and compared with state-of-the-art methods on real-world surveillanceThis work was supported by the EU, under the FP7 project APIDIS (ICT-216023) and the
Artemis JU and TSB as part of the COPCAMS project (332913)
Revisiting Fully Convolutional Geometric Features for Object 6D Pose Estimation
Recent works on 6D object pose estimation focus on learning keypoint
correspondences between images and object models, and then determine the object
pose through RANSAC-based algorithms or by directly regressing the pose with
end-to-end optimisations. We argue that learning point-level discriminative
features is overlooked in the literature. To this end, we revisit Fully
Convolutional Geometric Features (FCGF) and tailor it for object 6D pose
estimation to achieve state-of-the-art performance. FCGF employs sparse
convolutions and learns point-level features using a fully-convolutional
network by optimising a hardest contrastive loss. We can outperform recent
competitors on popular benchmarks by adopting key modifications to the loss and
to the input data representations, by carefully tuning the training strategies,
and by employing data augmentations suitable for the underlying problem. We
carry out a thorough ablation to study the contribution of each modification.Comment: 17 pages. Preprint, currently under revie
Data augmentation for NeRF: a geometric consistent solution based on view morphing
NeRF aims to learn a continuous neural scene representation by using a finite
set of input images taken from different viewpoints. The fewer the number of
viewpoints, the higher the likelihood of overfitting on them. This paper
mitigates such limitation by presenting a novel data augmentation approach to
generate geometrically consistent image transitions between viewpoints using
view morphing. View morphing is a highly versatile technique that does not
requires any prior knowledge about the 3D scene because it is based on general
principles of projective geometry. A key novelty of our method is to use the
very same depths predicted by NeRF to generate the image transitions that are
then added to NeRF training. We experimentally show that this procedure enables
NeRF to improve the quality of its synthesised novel views in the case of
datasets with few training viewpoints. We improve PSNR up to 1.8dB and 10.5dB
when eight and four views are used for training, respectively. To the best of
our knowledge, this is the first data augmentation strategy for NeRF that
explicitly synthesises additional new input images to improve the model
generalisation
Compositional Semantic Mix for Domain Adaptation in Point Cloud Segmentation
Deep-learning models for 3D point cloud semantic segmentation exhibit limited
generalization capabilities when trained and tested on data captured with
different sensors or in varying environments due to domain shift. Domain
adaptation methods can be employed to mitigate this domain shift, for instance,
by simulating sensor noise, developing domain-agnostic generators, or training
point cloud completion networks. Often, these methods are tailored for range
view maps or necessitate multi-modal input. In contrast, domain adaptation in
the image domain can be executed through sample mixing, which emphasizes input
data manipulation rather than employing distinct adaptation modules. In this
study, we introduce compositional semantic mixing for point cloud domain
adaptation, representing the first unsupervised domain adaptation technique for
point cloud segmentation based on semantic and geometric sample mixing. We
present a two-branch symmetric network architecture capable of concurrently
processing point clouds from a source domain (e.g. synthetic) and point clouds
from a target domain (e.g. real-world). Each branch operates within one domain
by integrating selected data fragments from the other domain and utilizing
semantic information derived from source labels and target (pseudo) labels.
Additionally, our method can leverage a limited number of human point-level
annotations (semi-supervised) to further enhance performance. We assess our
approach in both synthetic-to-real and real-to-real scenarios using LiDAR
datasets and demonstrate that it significantly outperforms state-of-the-art
methods in both unsupervised and semi-supervised settings.Comment: TPAMI. arXiv admin note: text overlap with arXiv:2207.0977
Survey on video anomaly detection in dynamic scenes with moving cameras
The increasing popularity of compact and inexpensive cameras, e.g.~dash
cameras, body cameras, and cameras equipped on robots, has sparked a growing
interest in detecting anomalies within dynamic scenes recorded by moving
cameras. However, existing reviews primarily concentrate on Video Anomaly
Detection (VAD) methods assuming static cameras. The VAD literature with moving
cameras remains fragmented, lacking comprehensive reviews to date. To address
this gap, we endeavor to present the first comprehensive survey on Moving
Camera Video Anomaly Detection (MC-VAD). We delve into the research papers
related to MC-VAD, critically assessing their limitations and highlighting
associated challenges. Our exploration encompasses three application domains:
security, urban transportation, and marine environments, which in turn cover
six specific tasks. We compile an extensive list of 25 publicly-available
datasets spanning four distinct environments: underwater, water surface,
ground, and aerial. We summarize the types of anomalies these datasets
correspond to or contain, and present five main categories of approaches for
detecting such anomalies. Lastly, we identify future research directions and
discuss novel contributions that could advance the field of MC-VAD. With this
survey, we aim to offer a valuable reference for researchers and practitioners
striving to develop and advance state-of-the-art MC-VAD methods.Comment: Under revie
Novel-View Human Action Synthesis
Novel-View Human Action Synthesis aims to synthesize the movement of a body
from a virtual viewpoint, given a video from a real viewpoint. We present a
novel 3D reasoning to synthesize the target viewpoint. We first estimate the 3D
mesh of the target body and transfer the rough textures from the 2D images to
the mesh. As this transfer may generate sparse textures on the mesh due to
frame resolution or occlusions. We produce a semi-dense textured mesh by
propagating the transferred textures both locally, within local geodesic
neighborhoods, and globally, across symmetric semantic parts. Next, we
introduce a context-based generator to learn how to correct and complete the
residual appearance information. This allows the network to independently focus
on learning the foreground and background synthesis tasks. We validate the
proposed solution on the public NTU RGB+D dataset. The code and resources are
available at https://bit.ly/36u3h4K.Comment: Asian Conference on Computer Vision (ACCV) 202
Cloud-based collaborative 3D reconstruction using smartphones
This article presents a pipeline that enables multiple users to collaboratively acquire images with monocular smartphones and derive a 3D point cloud using a remote reconstruction server. A set of key images are automatically selected from each smartphone’s camera video feed as multiple users record different viewpoints of an object, concurrently or at different time instants. Selected images are automatically processed and registered with an incremental Structure from Motion (SfM) algorithm in order to create a 3D model. Our incremental SfM approach enables on-the- y feedback to the user to be generated about current reconstruction progress. Feedback is provided in the form of a preview window showing the current 3D point cloud, enabling users to see if parts of a surveyed scene need further attention/coverage whilst they are still in situ. We evaluate our 3D reconstruction pipeline by performing experiments in uncontrolled and unconstrained real-world scenarios. Datasets are publicly available
1st Workshop on Maritime Computer Vision (MaCVi) 2023: Challenge Results
The 1 Workshop on Maritime Computer Vision (MaCVi) 2023 focused
on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned
Surface Vehicle (USV), and organized several subchallenges in this domain: (i)
UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking,
(iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime
Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS
benchmarks. This report summarizes the main findings of the individual
subchallenges and introduces a new benchmark, called SeaDronesSee Object
Detection v2, which extends the previous benchmark by including more classes
and footage. We provide statistical and qualitative analyses, and assess trends
in the best-performing methodologies of over 130 submissions. The methods are
summarized in the appendix. The datasets, evaluation code and the leaderboard
are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.Comment: MaCVi 2023 was part of WACV 2023. This report (38 pages) discusses
the competition as part of MaCV
- …